feat(embeddings): graph-enriched strategy + context window overflow detection#74
Conversation
Greptile SummaryAdds a graph-enriched embedding strategy that uses dependency graph context (callers/callees/parameters/comments) instead of raw source code, dramatically reducing token usage from ~360 to ~100 tokens per symbol while improving semantic search quality. Introduces context window overflow detection that estimates tokens, truncates oversized text, and records truncation metadata. Major changes:
Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[buildEmbeddings] --> B{Strategy?}
B -->|structured| C[Prepare graph queries<br/>calleesStmt, callersStmt]
B -->|source| D[Skip graph prep]
C --> E[Loop through nodes by file]
D --> E
E --> F{Strategy?}
F -->|structured| G[buildStructuredText<br/>Extract params, graph context, comments]
F -->|source| H[buildSourceText<br/>Extract raw source lines]
G --> I[estimateTokens]
H --> I
I --> J{Tokens > contextWindow?}
J -->|Yes| K[Truncate text<br/>Increment overflowCount]
J -->|No| L[Keep text as-is]
K --> M[Add to texts array]
L --> M
M --> N{More nodes?}
N -->|Yes| E
N -->|No| O{overflowCount > 0?}
O -->|Yes| P[warn user about truncation]
O -->|No| Q[Continue]
P --> R[embed texts]
Q --> R
R --> S[Store embeddings + metadata<br/>including strategy and truncated_count]
Last reviewed commit: d5538b7 |
…etection Addresses #55 and #56 with a combined solution: - Add contextWindow to MODELS registry for all 7 embedding models - New 'structured' embedding strategy (default): uses callers/callees from the dependency graph, extracted parameters, and leading comments instead of raw source code (~100 tokens vs ~360 avg) - Detect and truncate symbols exceeding model context window, with user-facing warning and truncated_count in embedding_meta - Add --strategy flag to CLI embed command (structured|source) - Store strategy in embedding_meta for auditability - Export estimateTokens() and EMBEDDING_STRATEGIES for programmatic use Closes #55, closes #56 Impact: 5 functions changed, 3 affected
Standalone script that compares structured vs source strategies against 14 real search queries on the project's own graph. Reports Hit@1/3/5 and per-query rank comparison. Usage: node tests/search/embedding-benchmark.js [--model minilm] Impact: 1 functions changed, 0 affected
76d0c6f to
56a0517
Compare
|
Claude finished @carlos-alm's task —— View job PR Review Complete ✅Conducting comprehensive review of the graph-enriched embedding strategy implementation. Review Checklist
Phase 1: Root Cause Analysis ✅Problems Being Solved:
Why This Approach is Justified:
This is solving real, documented issues with a well-reasoned architectural approach. Phase 2: Configuration Change Detection ✅No concerning configuration changes found:
This is clean implementation without any rules relaxation. Phase 3: Test Integrity ✅Excellent test coverage with 15 comprehensive tests:
No red flags found:
Phase 4: Code Quality Review ✅Security: ✅ Clean
Error Handling: ✅ Robust
Performance: ✅ Well-optimized
API Design: ✅ Solid
Code Quality: ✅ High
Root Cause AnalysisProblem Being Solved: Context window overflow and poor semantic signal from raw code embeddings. The original strategy was creating ~360 token embeddings that exceeded model limits and provided noisy semantic signal. Why This Approach: Graph-enriched structured strategy leverages the dependency graph (codegraph's core value) to create compact (~100 token), semantically rich embeddings from caller/callee relationships, parameters, and comments instead of raw implementation code. Risk Assessment: Very low risk. Backward compatible, comprehensive tests, no configuration relaxation, proper error handling. Critical ConcernsNone. This is exemplary code with excellent engineering practices. Final Recommendation
|
Summary
structuredembedding strategy (new default): embeds callers/callees from the dependency graph, extracted parameters, and leading comments instead of raw source code — producing ~100 tokens per symbol vs ~360 avg, with dramatically better semantic signal for natural language search queriesMODELSnow has acontextWindowfield;buildEmbeddingsestimates token count, truncates oversized texts, warns the user, and recordstruncated_countinembedding_meta--strategyCLI flag:codegraph embed --strategy structured|sourcelets users choose between the new graph-enriched approach and the original raw-code strategysourcestrategy and edge casesCloses #55, closes #56
Test plan
tests/search/embedding-strategy.test.js(15 tests): structured text includes graph context (Calls/Called by/Parameters/leading comments), source text uses raw code, overflow detection warns + truncates, strategy stored in metadata, defaults to structuredtests/search/embedder-search.test.js(16 tests): no regressions in search/RRF functionality